[PPL][analytics-engine-compatibility] Lower isempty/isblank to CHAR_LENGTH = 0 instead of multiset IS_EMPTY#5439
Conversation
…_EMPTY PPLFuncImpTable was emitting SqlStdOperatorTable.IS_EMPTY against string operands as part of its isempty(x) / isblank(x) lowering. That operator is the SQL:2003 multiset emptiness predicate — its OperandTypeChecker is COLLECTION_OR_MAP and its enumerable-runtime binding calls java.util.Collection.isEmpty() reflectively. Passing a string slipped through only because RexBuilder.makeCall bypasses the operand checker and Calcite's Linq4j codegen emits a bare `target.isEmpty()` that happens to bind to String.isEmpty() at Janino compile time. Outside of the script-pushdown path, the abuse breaks: any backend that does not run RexNodes through Calcite's enumerable codegen — most notably Substrait emitters such as the analytics-engine route — has no isthmus mapping for SqlStdOperatorTable.IS_EMPTY and rejects the plan with "unresolved function reference IS EMPTY". Replace the lowering with the predicate that actually matches the doc: isempty(x) -> OR(IS_NULL(x), CHAR_LENGTH(x) = 0) isblank(x) -> OR(IS_NULL(x), CHAR_LENGTH(TRIM(BOTH ' ' FROM x)) = 0) CHAR_LENGTH is a standard string operator with explicit type semantics, backed by SqlFunctions.charLength(String) in Calcite's enumerable runtime, and it round-trips through Substrait's default extension catalog. Same observable behavior on existing PPL paths; correct behavior on Substrait-based paths. Also remove PredicateAnalyzer.containIsEmptyFunction. It existed solely to abort DSL pushdown when it saw the OR(IS_NULL, IS_EMPTY) shape because the SHOULD branches would NPE on null operands evaluating .isEmpty() through the script. The new shape produces OR(IS_NULL, CHAR_LENGTH = 0); the IS_NULL arm pushes down to must_not.exists, the CHAR_LENGTH arm to a script that returns null (not NPE) for null fields, and the SHOULD union answers correctly. Test plan: - CalcitePPLConditionBuiltinFunctionIT.testIsEmpty / testIsBlank pass. - CalciteExplainIT.testExplainIsEmpty / testExplainIsBlank / testExplainIsEmptyOrOthers pass with regenerated golden plans for both pushdown-enabled and no-pushdown variants. - PredicateAnalyzerTest passes, including the unrelated equals_scriptPushDown_Struct test that still uses SqlStdOperatorTable.IS_EMPTY for its legitimate map-emptiness semantics. Signed-off-by: Songkan Tang <songkant@amazon.com>
The previous test (isNullOr_ScriptPushDown) only asserted that the builder
string contained the opensearch_compounded_script lang marker. With the
isempty -> OR(IS_NULL, CHAR_LENGTH=0) lowering, the IS_NULL arm now pushes
down as a native bool.must_not.exists clause and only the CHAR_LENGTH=0
arm remains a script — so the substring assertion still passed without
verifying the actual structure. Renamed and rewritten to assert:
- Top-level builder is a BoolQueryBuilder with exactly two should arms
and no must / top-level must_not clauses.
- Arm 0 is a BoolQueryBuilder whose mustNot wraps an ExistsQueryBuilder
(the IS_NULL lowering).
- Arm 1 is a ScriptQueryBuilder using the opensearch_compounded_script
lang (the CHAR_LENGTH=0 lowering).
Also documents inline why the prior containIsEmptyFunction NPE-prevention
detector is no longer needed: CHAR_LENGTH(null) follows Calcite's STRICT
null policy and returns null instead of NPE, so DSL's non-short-circuiting
should evaluation is safe.
Signed-off-by: Songkan Tang <songkant@amazon.com>
PR Reviewer Guide 🔍(Review updated until commit 217d477)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 217d477
Previous suggestionsSuggestions up to commit 7177ad6
|
Signed-off-by: Songkan Tang <songkant@amazon.com>
|
Persistent review updated to latest commit 217d477 |
…oute PPL's PPLFuncImpTable lowers isempty(x) / isblank(x) at RelNode build time (see opensearch-project/sql#5439): isempty(x) -> OR(IS_NULL(x), CHAR_LENGTH(x) = 0) isblank(x) -> OR(IS_NULL(x), CHAR_LENGTH(TRIM(BOTH ' ' FROM x)) = 0) The constituent operators all round-trip to DataFusion via the standard Substrait extension catalog already — IS_NULL and the boolean connectives are core Substrait expression types, CHAR_LENGTH maps to DataFusion's length via ADDITIONAL_SCALAR_SIGS, EQUALS and TRIM are baseline SQL. The only outstanding gap on the analytics-engine route is that OpenSearchProjectRule.BASELINE_SCALAR_OPS lacked AND / OR / NOT, so the project rule fails capability resolution on the OR connective before it ever recurses to the per-leaf operators. The filter rule already recurses into AND / OR / NOT structurally; this commit brings the project rule to parity. With OR baseline, the per-leaf capability resolution drives down through IS_NULL and the EQUALS(CHAR_LENGTH, 0) arm — both already supported. Verification: - 6 REST IT cases in StringScalarFunctionsIT exercise the full PPL -> Calcite -> Substrait -> DataFusion path against the calcs dataset: isempty on non-empty / null / computed-empty fields and isblank on all-whitespace / non-blank / null inputs. All pass against a runTask cluster on /_analytics/ppl. Signed-off-by: Songkan Tang <songkant@amazon.com>
Description
PPL's
isempty(x)/isblank(x)were lowered inPPLFuncImpTableusingSqlStdOperatorTable.IS_EMPTYagainst string operands. That operator is the SQL:2003 multiset emptiness predicate — itsOperandTypeCheckerisOperandTypes.COLLECTION_OR_MAPand its enumerable-runtime binding callsjava.util.Collection.isEmpty()reflectively.Passing a string slipped through only because:
RexBuilder.makeCallbypasses the operand checker.target.isEmpty()(a bare method-name call), which Janino then resolves against the actual operand type.String.isEmpty()happens to exist with the same signature, so the script-path runtime works by coincidence.This shape is fragile:
SqlStdOperatorTable.IS_EMPTY, and the plan is rejected with"unresolved function reference IS EMPTY".PredicateAnalyzerhad to special-case-block DSL pushdown for theOR(IS_NULL, IS_EMPTY)shape because OpenSearch's non-short-circuitingbool.shouldwould NPE evaluatingname.isEmpty()on a null operand. So the entire OR was wrapped in a single script_query — a strict pessimization compared to a mixed bool.should.Issues Resolved
This PR replaces the lowering with the predicate that actually matches the documented semantics:
CHAR_LENGTHis a standard string operator with explicit type semantics, backed bySqlFunctions.charLength(String)in Calcite's enumerable runtime, and round-trips through Substrait's default extension catalog. Same observable behavior on existing PPL paths; correct behavior on Substrait-based paths.The
containIsEmptyFunctionguard inPredicateAnalyzer.andOris removed:CHAR_LENGTH(null)follows Calcite'sNullPolicy.STRICTand returns null rather than NPE, so DSL's non-short-circuitingshouldevaluation is safe even when the field is null. This also unlocks better pushdown — theIS_NULLarm is now a nativebool.must_not.existsclause and only theCHAR_LENGTH=0arm remains a script. Null docs no longer touch the script engine.The
isNullOr_ScriptPushDowntest was renamed toisEmpty_pushesIsNullArmAsExistsAndCharLengthArmAsScriptand rewritten to assert the actual structural shape of the pushdown (2-arm bool.should with the expected types per arm) instead of just substring-matching the script lang marker.Plan changes are reflected in the regenerated explain golden files for both
calcite/andcalcite_no_pushdown/variants.Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.